Introduction

“As I scurried across the candlelit chamber, manuscripts in hand, I thought I’d made it. Nothing would be able to hurt me anymore. Little did I know there was one last fright lurking around the corner.” This is a part of a horror story which terrifies and excites us. In the passage, we are going to analyze the data composed of horror stories written by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. The data was prepared by chunking larger texts into sentences using CoreNLP’s MaxEnt sentence tokenizer. Specifically we would like to consider the similarities and the differences between the texts attributed to each author and study patterns that could be used to characterize the writing styles of the three authors.

Libraries Preparation

packages.used=c("widyr","ggraph","igraph","stringr","scales","spacyr","cleanNLP","readr","stringi","ggplot2","corrplot","dplyr","tidyr","forcats","reshape2","ggridges","corrgram","textstem","tidytext","tm","topicmodels","wordcloud","RSentiment","jpeg")

# check packages that need to be installed.
packages.needed=setdiff(packages.used, 
                        intersect(installed.packages()[,1], 
                                  packages.used))
# install additional packages
if(length(packages.needed)>0){
  install.packages(packages.needed, dependencies = TRUE)
}
library('jpeg')
library('widyr')
library('ggraph')
library('igraph')
library('stringr')
library('scales')
library('spacyr')
library('cleanNLP')
library('readr')
library('stringi')
library('ggplot2') 
library('corrplot') 
library('dplyr') 
library('tidyr') 
library('forcats') 
library('reshape2')
library('ggridges') 
library('corrgram')
library('textstem')
library('tidytext')
library('tm') 
library('topicmodels') 
library('wordcloud')
library("RSentiment")

# Models
source("../lib/multiplot.R")

Data Preparations

First, before we dive into the data. Let’s take a glimpse of the data offered.

spookydata = read.csv('../data/spooky.csv', as.is = TRUE)

Then, we need to pre-precess the text data.

spookydata <- spookydata %>%
  filter(str_detect(text, "^[^>]+[A-Za-z\\d]") | text == ""
         )

Then, let’s find whether the sentence length among the authors will vary much.

p <- spookydata %>%
  mutate(sen_len = str_length(text)) %>%
  ggplot(aes(sen_len, author, fill = author)) +
  geom_density_ridges() +
  scale_x_log10() +
  theme(legend.position = "right") +
  labs(x = "Sentence length")
#jpeg(file="../figs/sentence.jpeg")

#plot(p)
#dev.off()

plot(p)
## Picking joint bandwidth of 0.0414


Looks like the three authors’ sentence length distribution varies. HP Lovecraft prefers long sentence and is more focused on using sentences with length around 200.

Second, let’s do some simple treatment to our data: remove the invalid information incluing tokens. Also, we could do the lemmatization to the words.

spooky_wrd <- lemmatize_words(spookydata) %>%
   unnest_tokens(word, text) %>%
   # remove stopwords
   anti_join(stop_words, by = "word") %>%
   count(author, word) %>%
   ungroup()

Word Clouds

In this part, lets’ make a word cloud to see the most common words used by the three authors together and separately

spooky_wrd_all <- spooky_wrd %>%
  group_by(word) %>%
  summarise(n = sum(n)) %>%
  ungroup()
#jpeg(file="../figs/wordcloud_1.jpeg")

#wordcloud(spooky_wrd_all$word, spooky_wrd_all$n,
#          max.words = 200, scale = c(2.0,0.5),
#          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_all$word, spooky_wrd_all$n,
          max.words = 200, scale = c(2.0,0.5),
          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])


Those words are the most common words in the datasets, we have seen that the authors would like to use the word like “life”,“death”,“door” and “light”. It seems that the horror fictions would like to decorate the normal life with life and death. Naturally, we would assume some diffrence between different authors. Now, let’s see if there is any difference between them.

spooky_wrd_MWS <- spooky_wrd %>%
  filter(author == "MWS") %>%
  group_by(word) %>%
  summarise(n = sum(n)) %>%
  ungroup()
#jpeg(file="../figs/wordcloud_MWS.jpeg")
#wordcloud(spooky_wrd_MWS$word, spooky_wrd_MWS$n,
#          max.words = 200, scale = c(2.0,0.5),
#          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_MWS$word, spooky_wrd_MWS$n,
          max.words = 200, scale = c(2.0,0.5),
          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])


It seems that Mary Shelley is more focused on the human body and relationships.

spooky_wrd_EAP <- spooky_wrd %>%
  filter(author == "EAP") %>%
  group_by(word) %>%
  summarise(n = sum(n)) %>%
  ungroup()
#jpeg(file="../figs/wordcloud_EAP.jpeg")

#wordcloud(spooky_wrd_EAP$word, spooky_wrd_EAP$n,
#          max.words = 200, scale = c(2.0,0.5),
#          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_EAP$word, spooky_wrd_EAP$n,
          max.words = 200, scale = c(2.0,0.5),
          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])

As for Edgar Allan Poe, it seems that he is more concerned about the time. Maybe he is the kind of writer who is good at creating the atmosphere of emergencies.

spooky_wrd_HPL <- spooky_wrd %>%
  filter(author == "HPL") %>%
  group_by(word) %>%
  summarise(n = sum(n)) %>%
  ungroup()
#jpeg(file="../figs/wordcloud_HPL.jpeg")

#wordcloud(spooky_wrd_HPL$word, spooky_wrd_HPL$n,
#          max.words = 200, scale = c(2.0,0.5),
#          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_HPL$word, spooky_wrd_HPL$n,
          max.words = 200, scale = c(2.0,0.5),
          colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])


Well, HP Lovecraft is different from the other two that he prefers the night. Perhaps, his story happens mostly at night.

Similarity of authors

simi_author <- spooky_wrd %>%
   spread(author, n, fill = 0) %>%
   na.omit()

simi_word <- spooky_wrd %>%
   spread(word, n, fill = 2) %>%
   na.omit()
#jpeg(file="../figs/corrgram_simi.jpeg")

#corrgram(simi_author, lower.panel = panel.shade, upper.panel = panel.cor)
#dev.off()
corrgram(simi_author, lower.panel = panel.shade, upper.panel = panel.cor)

By comparing the words used by each author, we now see that these horror fiction authors are alike in words. Among them,the most likelihood happens between HP Lovecraft and Edgar Allan Poe, even the least likelihood in words amounts to 0.56. By ordering the likelihood, Edgar and HP Lovecraft are the most alike authors, Edgar and Mary are the second likelihhod, Mary and HP Lovecraft is the least likelihood.

It will also be interesting to see whether there is any major difference among the authors about the genders emphasis of the fictions.

p  <- spookydata %>%
  unnest_tokens(word, text) %>%
  
  mutate(female = ( word == "she" | word == "her" | word == "hers" | word == "female" | word == "girl"|
                    word == "woman" | word == "lady" | word == "madam" |
                    word == "women" )) %>%
  mutate(male = ( word == "he" | word == "him" | word == "his" | word == "male" |word == "boy"|
                    word == "man" | word == "gentleman" | word == "sir" |
                    word == "lord" | word == "men" )) %>%
  unite(sex, female, male) %>%
  mutate(sex = fct_recode(as.factor(sex), male = "FALSE_TRUE", 
                          female = "TRUE_FALSE", other = "FALSE_FALSE")) %>%
  filter(sex != "other") %>%
  ggplot(aes(sex, fill = author)) +
  labs(x = "Gender Difference") +
  geom_bar(position = "dodge")

#jpeg(file="../figs/compare_fema.jpeg")

#plot(p)
#dev.off()
plot(p)


Apparently, male characters are the leading gender in the horror fictions. Separately, HP Lovecraft is much more focused on the male characters than the other authors. Mary, as a female author, pays more attention to the female characters than the two other male authors. Then, what exactly words did they use as gender indications. Let’s find it out.

p1 <- spookydata %>%
  unnest_tokens(word, text) %>%
  filter((word == "him") | (word == "his") | (word == "man") | (word == "gentleman") | (word == "male") | (word == "lord") |(word == "boy")| (word == "he") | (word == "men")) %>%
  ggplot(aes(word, fill = author)) +
  geom_bar(position = "dodge")

p2 <- spookydata %>%
  unnest_tokens(word, text) %>%
  filter(( word == "she" | word == "her" | word == "hers" | word == "female" | word == "girl"|
                    word == "woman" | word == "lady" | word == "madam" |
                    word == "women" )) %>%
  ggplot(aes(word, fill = author)) +
  geom_bar(position = "dodge")

layout <- matrix(c(1,2),2,1,byrow=TRUE)

#jpeg(file="../figs/compare_fema_1.jpeg")
#multiplot(p1, p2, layout=layout)
#dev.off()
multiplot(p1, p2, layout=layout)
## Loading required package: grid

Apparently, the simple words“he, him, his” and “her”,“she” are the most common words in the charts. And it is of no apparent difference among the authors about the preference of the words in each gender.

t_heshe <- spookydata %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "genderword"), sep = " ")


p_heshe<-t_heshe %>%
  filter(word1 == "she" | word1 == "he" ) %>%
  group_by(author, word1, genderword) %>%
  count() %>%
  ungroup() %>%
  group_by(author, word1) %>%
  top_n(5,n) %>%
  ggplot(aes(genderword, n, fill = author)) +
  geom_col() +
  scale_y_log10() +
  coord_flip() +
  facet_grid(word1 ~ author)
#jpeg(file="../figs/compare_fema_2.jpeg")

#plot(p_heshe)

#dev.off()
plot(p_heshe)

t_himh <- spookydata %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "genderword"), sep = " ")


p_himh<-t_himh %>%
  filter(word1 == "her" | word1 == "him" ) %>%
  group_by(author, word1, genderword) %>%
  count() %>%
  ungroup() %>%
  group_by(author, word1) %>%
  top_n(5,n) %>%
  ggplot(aes(genderword, n, fill = author)) +
  geom_col() +
  scale_y_log10() +
  coord_flip() +
  facet_grid(word1 ~ author)
#jpeg(file="../figs/compare_fema_3.jpeg")

#plot(p_himh)
#dev.off()
plot(p_himh)


It is easy to see that Lovecraft is apparently less interested in using “she” and “her” than the other authors. And EAP and MWS are quite alike in using words associated with “she”,“he”,“her”and “him”. We may infer that Lovecraft’ writing styles in characters are different from the other two while the other two are quite alike.

Find the tf-idf within the authors

It is natural to expect the fictions groups differing in terms of topic and content, therefore the frequency of words will be different among them.

# first step is to tokenize the sentence, and we do lemmatize the words in the same time
usenet_words <- lemmatize_words(spookydata) %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)
#second get the words grouped by the authors.
words_by_author <- usenet_words %>%
  count(author, word, sort = TRUE) %>%
  ungroup()

words_by_author

Then, we generate the tf_idf matrix

tf_idf <- words_by_author %>%
  bind_tf_idf(word, author, n) %>%
  arrange(desc(tf_idf))

tf_idf
p_tf_idf<-tf_idf %>%
  group_by(author) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill = author)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~author, scales = "free") +
  ylab("tf-idf") +
  coord_flip()
#jpeg(file="../figs/compare_tf_idf.jpeg")

#plot(p_tf_idf)
#dev.off()
plot(p_tf_idf)


It is clearly seen that the authors have different emphasis on different words. Then, what is the similarities among them? We could discover that by finding the pairwise correlation of word frequencies within each author.

author_cors <- words_by_author %>%
  pairwise_cor(author, word, n, sort = TRUE)
set.seed(2017)

p_author<-author_cors %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size =10, color = "lightblue") +
  geom_node_text(aes(label = name),color='red' ,repel = TRUE) +
  theme_void()
#jpeg(file="../figs/simi_author_idf.jpeg")

#plot(p_author)
#dev.off()
plot(p_author)


It is clear that the tf-idf analysis coincides with the previous number counts methods in comparing the writing styles.Edgar is likely both the other writers, while the other two is less likely.

Sentiment Analysis Part

After exploring the words, let’s start the sentiment analysis. ## Emotion

spooky_wrd <- lemmatize_words(spookydata) %>% unnest_tokens(word, text)%>%
  anti_join(stop_words, by = "word")

Let’s see the sentiment analysis among the different authors.

pic_wrd_mws <-spooky_wrd %>%
    filter(author == "MWS") %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(8, n) %>%
  mutate(word = reorder(word, n)) %>%
  ungroup() %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  coord_flip()+
  labs( x = NULL,y = "Sentiment Analysis") +
  ggtitle("Mary Shelley: Negative  Positive Words")
#jpeg(file="../figs/senti_MWS.jpeg")

#plot(pic_wrd_mws)
#dev.off()
plot(pic_wrd_mws)


As for Mary, she is mostly focused on the negative, positive, uncertainty words. And specifically, as for the negative words, she likes “fear”, “lost” and “poor”.Her top 3 positive words are “happiness”,“happy”,“pleasure”. Her top 3 positive words are “appeared”,“suddenly”,“unknown”.

pic_wrd_hpl<-spooky_wrd %>%
    filter(author == "HPL") %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(8, n) %>%
  mutate(word = reorder(word, n)) %>%
  ungroup() %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  coord_flip()+
  labs( x = NULL,y = "Sentiment Analysis") +
  ggtitle("H P Lovecraft: Negative  Positive Words")
#jpeg(file="../figs/senti_HPL.jpeg")

#plot(pic_wrd_hpl)
#dev.off()
plot(pic_wrd_hpl)

Similarly, HP Lovecraft is very much like Mary, mainly focusing on the negative, positive and uncertainty words. Also, his top three negative words are: “fear”,“lost” and “recall”. Surprisingly, he shares the “fear” and “lost” with Mary. While for the positive words, his top three words are “dream”,“leading”,“fantastic”. His top three words in uncertainty are “unknown”,“suddenly” and “appeared”.

pic_wrd_eap<-spooky_wrd %>%
    filter(author == "EAP") %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(8, n) %>%
  mutate(word = reorder(word, n)) %>%
  ungroup() %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  coord_flip()+
  labs( x = NULL,y = "Sentiment Analysis") +
  ggtitle("E A Poe: Negative  Positive Words")
#jpeg(file="../figs/senti_EAP.jpeg")

#plot(pic_wrd_eap)
#dev.off()
plot(pic_wrd_eap)

Lastly, E A Poe also mainly focuses on the negative, positive and uncertainty words. Also, his top three negative words are: “doubt”,“question” and “difficulty”. Surprisingly, he differs from the previous two authors in negative words. While for the positive words, his top three words are “beautiful”,“easily”,“excited”. His top three words in uncertainty are “doubt”,“appeared” and “suddenly”.

In all, as for the sentiment words, all the authors focuses on the words “positive” “negative” and “uncertainty” while Mary and Lovecraft share favorite some negative words. E A Poe, however uses much different words from the other two authors.

Then, we may visualize the sentiment clustering word cloud as below.

#jpeg(file="../figs/sent_wordcloud.jpeg")

spooky_wrd %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 500)

#dev.off()

The super negative Index

To further anlayze the sentiment among the fictions, we could do some numerical analysis. Define the super negative Index = (#uncertainty+#Negative + #constraining)/(#Postive + #Negative + #constraining + #uncertainty)

pic1 <- spooky_wrd %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  ggplot(aes(author, fill = sentiment)) +
  geom_bar(position = "fill")

pic2 <- spooky_wrd %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  group_by(author, id, sentiment) %>%
  count() %>%
  spread(sentiment, n, fill = 0) %>%
  group_by(author, id) %>%
  summarise(neg = sum(negative),
            con = sum(constraining),
            unc = sum(uncertainty),
            pos = sum(positive)) %>%
  arrange(id) %>%
  mutate(frac_neg = 1 - pos/(pos + neg + con+unc)) %>%
  ggplot(aes(frac_neg, fill = author)) +
  geom_density(bw = .3, alpha = 0.5) +
  theme(legend.position = "right") +
  labs(x = "d")

layout <- matrix(c(1,2),1,2,byrow=TRUE)

#jpeg(file="../figs/super_negative_index.jpeg")
#multiplot(pic1, pic2, layout=layout)
#dev.off()
multiplot(pic1, pic2, layout=layout)


This picture directly reveals the sentiment distribution among the authors. All of the authors focuse most on negative, then positive and uncertainty. While the two male authors’ focus on uncertainty words are more than the female author: Mary.

N-gram Analysis

We have been considering the single words statistics analysis for the fictions. It is interesting to do the n-gram analysis.

As usual, we do the lemmatizing and remove the stopwords.

usenet_bigrams <- lemmatize_words(spookydata) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)%>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)%>%
    unite(bigram, word1, word2, sep = " ")


usenet_bigram_counts <- usenet_bigrams %>%
  count(author, bigram, sort = TRUE) %>%
  ungroup() %>%
  separate(bigram, c("word1", "word2"), sep = " ")
usenet_bigram_counts
bigram_tf_idf <- usenet_bigrams %>%
  count(author, bigram) %>%
  bind_tf_idf(bigram, author, n) %>%
  arrange(desc(tf_idf))
#jpeg(file="../figs/tf_idf_authors.jpeg")

bigram_tf_idf %>%
  group_by(bigram_tf_idf$author)%>%
    top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(bigram, tf_idf, fill = author)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~author, scales = "free") +
  ylab("tf-idf") +
  coord_flip()

#dev.off()

As we can see, Lovecraft and EAP used the words like “ha ha” and “heh heh” most, while Mary pays little attention to them. It may be a major different between the male authors and female authors of horror fictions.

Topic Models

First, let’s explore the structure using Latent Dirichlet Allocation Moddel(LDA). This method yields an unsupervised classifictaion of documents. By seeking the clusters corresponding to different topics, we will be able to find underlting structure in the data.

# divide into documents, each representing one chapter
by_chapter <- spookydata %>%
  group_by(author) %>%
  #mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() #%>%
  #filter(chapter > 0) %>%
  #unite(document, author, text)

# split into words
by_chapter_word <- by_chapter %>%
  unnest_tokens(word, text)

# find document-word counts
word_counts <- by_chapter_word %>%
  anti_join(stop_words) %>%
  count(author, word, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
word_counts
chapters_dtm <- word_counts %>%
  cast_dtm(author, word, n)

chapters_dtm
## <<DocumentTermMatrix (documents: 3, terms: 24949)>>
## Non-/sparse entries: 40182/34665
## Sparsity           : 46%
## Maximal term length: 19
## Weighting          : term frequency (tf)

I use the LDA model for topic modelling with potential 10 topics.And save it to the output folder.

k <- 10
chapters_lda <- LDA(chapters_dtm, k = 10,method = "Gibbs", control = list(seed = 1234))

chapters_lda
## A LDA_Gibbs topic model with 10 topics.
chapter_topics <- tidy(chapters_lda, matrix = "beta")
chapter_topics
top_terms <- chapter_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms
pic_topic<-top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()
#jpeg(file="../figs/topics.jpeg")
#plot(pic_topic)
#dev.off()
plot(pic_topic)


We can now see the ten topics, and the 5 most frequent words in each topic.

chapters_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_gamma
chapters_gamma <- chapters_gamma %>%
  separate(document, c("author"), sep = "_", convert = TRUE)

chapters_gamma
topic_a<-chapters_gamma %>%
  mutate(author = reorder(author, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ author)
#jpeg(file="../figs/topics_author.jpeg")
#plot(topic_a)
#dev.off()
plot(topic_a)


Clearly, in the topic 1 which describes the atmosphere and envronment, Lovecraft is more focused on it than the other two authors.While Edgar focuses on topic 2 and Mary focuses on topic 9. The major differences revealed in this plot best illustrates the difference among the authors.

chapter_classifications <- chapters_gamma %>%
  group_by(author) %>%
  top_n(1, gamma) %>%
  ungroup()

chapter_classifications
book_topics <- chapter_classifications %>%
  count(author, topic) %>%
  group_by(author) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = author, topic)

assignments <- augment(chapters_lda, data = chapters_dtm)
assignments <- assignments %>%
  separate(document, c("author"), sep = "_", convert = TRUE) %>%
  inner_join(book_topics, by = c(".topic" = "topic"))
heat_map<-assignments %>%
  count(author, consensus, wt = count) %>%
  group_by(author) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, author, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Topics assigned to",
       y = "Topics came from",
       fill = "% of topics")
#jpeg(file="../figs/topics_heatmap.jpeg")
#plot(heat_map)
#dev.off()
plot(heat_map)


It is not hard to see, by the measurement of topics, these authors basically don’t have much in common. So it would be wise to classify the fictions according to the topics.

In all, the three authors vary in the wrting styles in many aspects including the gender, the sentence length, the focused topic and the words focus. However, the three authors are much alike each other in the distribution of emotions of the words. This may be viewed as a pattern for horror fictions.

References:

1.https://www.kaggle.com/headsortails/treemap-house-of-horror-spooky-eda-lda-features
2. https://www.tidytextmining.com/